Exploring Property Assessment and Sales Data for Informed Decision-Making. In our quest for informed decision-making in real estate, we are presented with a comprehensive dataset encompassing various attributes related to property assessment and sales transactions. This dataset includes information such as the assessed value, sale amount, sales ratio, property type, and more, offering a rich source of insights into the real estate landscape.
# import all the necessary libraries
import numpy as np
import pandas as pd
# for visuals
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
plt.style.use('ggplot')
# import and read file ===> # pd.read_excel() #pd.read_jason() # pd.read_tsv()
df = pd.read_csv(r'C:\Users\akosu\Downloads\Real_Estate_Sales_2001-2020_GL.csv',encoding = 'ISO-8859-1')
df
C:\Users\akosu\AppData\Local\Temp\ipykernel_1820\209286398.py:2: DtypeWarning: Columns (8,9,10,11,12) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(r'C:\Users\akosu\Downloads\Real_Estate_Sales_2001-2020_GL.csv',encoding = 'ISO-8859-1')
| Serial Number | List Year | Date Recorded | Town | Address | Assessed Value | Sale Amount | Sales Ratio | Property Type | Residential Type | Non Use Code | Assessor Remarks | OPM remarks | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020348 | 2020 | 09/13/2021 | Ansonia | 230 WAKELEE AVE | 150500.0 | 325000.0 | 0.463000 | Commercial | NaN | NaN | NaN | NaN | NaN |
| 1 | 20002 | 2020 | 10/02/2020 | Ashford | 390 TURNPIKE RD | 253000.0 | 430000.0 | 0.588300 | Residential | Single Family | NaN | NaN | NaN | NaN |
| 2 | 200212 | 2020 | 03/09/2021 | Avon | 5 CHESTNUT DRIVE | 130400.0 | 179900.0 | 0.724800 | Residential | Condo | NaN | NaN | NaN | NaN |
| 3 | 200243 | 2020 | 04/13/2021 | Avon | 111 NORTHINGTON DRIVE | 619290.0 | 890000.0 | 0.695800 | Residential | Single Family | NaN | NaN | NaN | NaN |
| 4 | 200377 | 2020 | 07/02/2021 | Avon | 70 FAR HILLS DRIVE | 862330.0 | 1447500.0 | 0.595700 | Residential | Single Family | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 997208 | 190272 | 2019 | 06/24/2020 | New London | 4 BISHOP CT | 60410.0 | 53100.0 | 1.137665 | Single Family | Single Family | 14 - Foreclosure | NaN | NaN | NaN |
| 997209 | 190284 | 2019 | 11/27/2019 | Waterbury | 126 PERKINS AVE | 68280.0 | 76000.0 | 0.898400 | Single Family | Single Family | 25 - Other | PRIVATE SALE | NaN | NaN |
| 997210 | 190129 | 2019 | 04/27/2020 | Windsor Locks | 19 HATHAWAY ST | 121450.0 | 210000.0 | 0.578300 | Single Family | Single Family | NaN | NaN | NaN | NaN |
| 997211 | 190504 | 2019 | 06/03/2020 | Middletown | 8 BYSTREK DR | 203360.0 | 280000.0 | 0.726300 | Single Family | Single Family | NaN | NaN | NaN | NaN |
| 997212 | 190344 | 2019 | 12/20/2019 | Milford | 250 RESEARCH DR | 4035970.0 | 7450000.0 | 0.541700 | NaN | NaN | NaN | NaN | NaN | NaN |
997213 rows × 14 columns
# .head() ===> to get the first 5 rows
df.head()
| Serial Number | List Year | Date Recorded | Town | Address | Assessed Value | Sale Amount | Sales Ratio | Property Type | Residential Type | Non Use Code | Assessor Remarks | OPM remarks | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020348 | 2020 | 09/13/2021 | Ansonia | 230 WAKELEE AVE | 150500.0 | 325000.0 | 0.4630 | Commercial | NaN | NaN | NaN | NaN | NaN |
| 1 | 20002 | 2020 | 10/02/2020 | Ashford | 390 TURNPIKE RD | 253000.0 | 430000.0 | 0.5883 | Residential | Single Family | NaN | NaN | NaN | NaN |
| 2 | 200212 | 2020 | 03/09/2021 | Avon | 5 CHESTNUT DRIVE | 130400.0 | 179900.0 | 0.7248 | Residential | Condo | NaN | NaN | NaN | NaN |
| 3 | 200243 | 2020 | 04/13/2021 | Avon | 111 NORTHINGTON DRIVE | 619290.0 | 890000.0 | 0.6958 | Residential | Single Family | NaN | NaN | NaN | NaN |
| 4 | 200377 | 2020 | 07/02/2021 | Avon | 70 FAR HILLS DRIVE | 862330.0 | 1447500.0 | 0.5957 | Residential | Single Family | NaN | NaN | NaN | NaN |
# .tail() ===> to get the last 5 rows
df.tail()
| Serial Number | List Year | Date Recorded | Town | Address | Assessed Value | Sale Amount | Sales Ratio | Property Type | Residential Type | Non Use Code | Assessor Remarks | OPM remarks | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 997208 | 190272 | 2019 | 06/24/2020 | New London | 4 BISHOP CT | 60410.0 | 53100.0 | 1.137665 | Single Family | Single Family | 14 - Foreclosure | NaN | NaN | NaN |
| 997209 | 190284 | 2019 | 11/27/2019 | Waterbury | 126 PERKINS AVE | 68280.0 | 76000.0 | 0.898400 | Single Family | Single Family | 25 - Other | PRIVATE SALE | NaN | NaN |
| 997210 | 190129 | 2019 | 04/27/2020 | Windsor Locks | 19 HATHAWAY ST | 121450.0 | 210000.0 | 0.578300 | Single Family | Single Family | NaN | NaN | NaN | NaN |
| 997211 | 190504 | 2019 | 06/03/2020 | Middletown | 8 BYSTREK DR | 203360.0 | 280000.0 | 0.726300 | Single Family | Single Family | NaN | NaN | NaN | NaN |
| 997212 | 190344 | 2019 | 12/20/2019 | Milford | 250 RESEARCH DR | 4035970.0 | 7450000.0 | 0.541700 | NaN | NaN | NaN | NaN | NaN | NaN |
# shape of the data
df.shape
(997213, 14)
# check the info of the data # .info()
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 997213 entries, 0 to 997212 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Serial Number 997213 non-null int64 1 List Year 997213 non-null int64 2 Date Recorded 997211 non-null object 3 Town 997213 non-null object 4 Address 997162 non-null object 5 Assessed Value 997213 non-null float64 6 Sale Amount 997213 non-null float64 7 Sales Ratio 997213 non-null float64 8 Property Type 614767 non-null object 9 Residential Type 608904 non-null object 10 Non Use Code 289681 non-null object 11 Assessor Remarks 149864 non-null object 12 OPM remarks 9934 non-null object 13 Location 197697 non-null object dtypes: float64(3), int64(2), object(9) memory usage: 106.5+ MB
# .dtypes ===> for the data types of the columns
df.dtypes
Serial Number int64 List Year int64 Date Recorded object Town object Address object Assessed Value float64 Sale Amount float64 Sales Ratio float64 Property Type object Residential Type object Non Use Code object Assessor Remarks object OPM remarks object Location object dtype: object
# .columns # .to_list()
df.columns.to_list()
['Serial Number', 'List Year', 'Date Recorded', 'Town', 'Address', 'Assessed Value', 'Sale Amount', 'Sales Ratio', 'Property Type', 'Residential Type', 'Non Use Code', 'Assessor Remarks', 'OPM remarks', 'Location']
# check for mising values # .isna() # .isnull() ===> both return boolean mask
df.isnull()
| Serial Number | List Year | Date Recorded | Town | Address | Assessed Value | Sale Amount | Sales Ratio | Property Type | Residential Type | Non Use Code | Assessor Remarks | OPM remarks | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False | False | False | False | True | True | True | True | True |
| 1 | False | False | False | False | False | False | False | False | False | False | True | True | True | True |
| 2 | False | False | False | False | False | False | False | False | False | False | True | True | True | True |
| 3 | False | False | False | False | False | False | False | False | False | False | True | True | True | True |
| 4 | False | False | False | False | False | False | False | False | False | False | True | True | True | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 997208 | False | False | False | False | False | False | False | False | False | False | False | True | True | True |
| 997209 | False | False | False | False | False | False | False | False | False | False | False | False | True | True |
| 997210 | False | False | False | False | False | False | False | False | False | False | True | True | True | True |
| 997211 | False | False | False | False | False | False | False | False | False | False | True | True | True | True |
| 997212 | False | False | False | False | False | False | False | False | True | True | True | True | True | True |
997213 rows × 14 columns
# .sum() with .isnull() ===> gives the total of missing values in each column
df.isnull().sum()
Serial Number 0 List Year 0 Date Recorded 2 Town 0 Address 51 Assessed Value 0 Sale Amount 0 Sales Ratio 0 Property Type 382446 Residential Type 388309 Non Use Code 707532 Assessor Remarks 847349 OPM remarks 987279 Location 799516 dtype: int64
# Create a heatmap to visualize the missing values
missing_values = df.isnull().sum()
plt.figure(figsize=(12, 8))
sns.heatmap(df.isnull(), cmap='viridis', cbar=True, yticklabels=False)
# Display the missing values count on left side of the heatmap
for i, value in enumerate(missing_values):
if value > 0:
plt.text(i + 0.5, 0.5, str(value), color='red', fontweight='bold',
verticalalignment='center', horizontalalignment='center')
plt.title('Missing Values in DataFrame')
plt.show()
The information provided shows that serial number, List Year, Town, Assessed value, Sale Amount and Sales Ratio do not have any missing values, However the remaining data variables i.e.Date recorded, Address,Property Type, Residential Type, Non use code, Assessor Remarks, OPM remarks and Location all have missing values with OPM remarks recording the highest missing values of 987,279 and date recorded just 2 missing values.
# To check for negative values in 'Sale Amount'
negative_amount = df[df['Sale Amount'] < 0]
negative_amount
| Serial Number | List Year | Date Recorded | Town | Address | Assessed Value | Sale Amount | Sales Ratio | Property Type | Residential Type | Non Use Code | Assessor Remarks | OPM remarks | Location |
|---|
# To check for negative values in 'Assessed Value'
negative_value = df[df['Assessed Value'] <0]
negative_value
| Serial Number | List Year | Date Recorded | Town | Address | Assessed Value | Sale Amount | Sales Ratio | Property Type | Residential Type | Non Use Code | Assessor Remarks | OPM remarks | Location |
|---|
df['Property Type']
0 Commercial
1 Residential
2 Residential
3 Residential
4 Residential
...
997208 Single Family
997209 Single Family
997210 Single Family
997211 Single Family
997212 NaN
Name: Property Type, Length: 997213, dtype: object
# check the columns that are categorical # .select_dtype
cat_cols = df.select_dtypes(include = ['category', 'object']).columns.to_list()
cat_cols
['Date Recorded', 'Town', 'Address', 'Property Type', 'Residential Type', 'Non Use Code', 'Assessor Remarks', 'OPM remarks', 'Location']
# check the columns that are numerical # .select_dtype
num_cols = df.select_dtypes(include = ['float64', 'int64']).columns.to_list()
num_cols
['Serial Number', 'List Year', 'Assessed Value', 'Sale Amount', 'Sales Ratio']
# value counts for categorical columns
for column in cat_cols:
print(df[column].value_counts())
Date Recorded
07/01/2005 877
08/01/2005 859
07/01/2004 840
06/30/2005 828
09/30/2005 781
...
01/21/2006 1
07/07/2012 1
04/14/2012 1
07/28/2012 1
03/07/2020 1
Name: count, Length: 6387, dtype: int64
Town
Bridgeport 34201
Stamford 32529
Waterbury 28506
Norwalk 23960
New Haven 21346
...
Hartland 448
Scotland 430
Canaan 429
Union 261
***Unknown*** 1
Name: count, Length: 170, dtype: int64
Address
MULTI ADDRESSES 620
8 SOUTH RD 420
51 OLD SPRINGFIELD RD 172
312 N BISHOP AVE 114
RIGGS ST 106
...
655 120 TALCOTTVILLE RD 1
414 SALMON BRK ST 1
21 BUTTERNUT KNL 1
59 BURNSIDE AVE UT 2 & 1
126 PERKINS AVE 1
Name: count, Length: 714371, dtype: int64
Property Type
Single Family 401612
Condo 105420
Residential 60728
Two Family 26408
Three Family 12586
Vacant Land 3163
Four Family 2150
Commercial 1981
Apartments 486
Industrial 228
Public Utility 5
Name: count, dtype: int64
Residential Type
Single Family 445016
Condo 117780
Two Family 29609
Three Family 14081
Four Family 2418
Name: count, dtype: int64
Non Use Code
25 - Other 59916
14 - Foreclosure 52445
07 - Change in Property 34248
08 - Part Interest 15496
7.0 14381
...
38.0 1
48.0 1
68.0 1
33 1
75 1
Name: count, Length: 106, dtype: int64
Assessor Remarks
ESTATE SALE 5044
BELOW MARKET 2568
SHORT SALE 2510
NEW CONSTRUCTION 2048
FORECLOSURE 1847
...
LACK OF PARKING 1
estate sale;per verification house substan worse than as of assmt date 1
SOLD W/ GARAGE UNIT G-1 1
sold less than market value 1
MOTIVATED SELLER -DIVORCE 1
Name: count, Length: 66508, dtype: int64
OPM remarks
GOOD SALE PER MLS 978
NO MLS 500
NEW CONSTRUCTION 345
SHORT SALE PER MLS 326
TOTAL RENOVATION PER MLS 316
...
REMODELED PER MLS - SEE PREVIOUS SALE #170076 1
REO SALE - CASH ONLY SOLD AS IS 1
TWO SALES - ALSO SEE #1700209 1
NOT A VALID SALE PER TOWN SITE AND PER MLS SALE PRICE = $345,000 1
PER MLS CLOSING PRICE = $1,145,000 1
Name: count, Length: 4825, dtype: int64
Location
POINT (-72.36336 41.97461) 181
POINT (-73.41854 41.13449) 145
POINT (-73.45225 41.096) 87
POINT (-73.06359 41.52255) 77
POINT (-72.96095 41.54989) 73
...
POINT (-71.90591 41.34189) 1
POINT (-72.89441 41.8245) 1
POINT (-72.94279 41.69454) 1
POINT (-72.52547 41.26596) 1
POINT (-72.07006 41.53315) 1
Name: count, Length: 130529, dtype: int64
# value counts for numerical columns
for column in num_cols:
print(df[column].value_counts())
Serial Number
10010 172
10018 172
10002 172
10003 171
10009 171
...
124743 1
124571 1
125068 1
124724 1
1910419 1
Name: count, Length: 83537, dtype: int64
List Year
2004 84056
2020 66592
2003 64239
2005 61602
2002 59682
2001 59584
2019 58954
2018 50709
2016 49773
2014 49563
2006 48785
2015 46651
2017 45691
2009 42508
2013 39943
2012 35973
2007 35617
2010 33491
2008 32735
2011 31065
Name: count, dtype: int64
Assessed Value
0.0 6933
17500.0 603
105000.0 553
35000.0 536
70000.0 473
...
1802750.0 1
1192500.0 1
20886.0 1
54041.0 1
4035970.0 1
Name: count, Length: 91378, dtype: int64
Sale Amount
150000.0 9401
200000.0 8999
250000.0 8471
175000.0 7547
225000.0 7421
...
746167.0 1
338885.0 1
412776.0 1
1626501.0 1
1134708.0 1
Name: count, Length: 57372, dtype: int64
Sales Ratio
0.000000 8459
0.700000 964
1.000000 612
0.560000 422
0.630000 323
...
6.210131 1
0.298041 1
0.393985 1
0.547837 1
1.137665 1
Name: count, Length: 547044, dtype: int64
Missing values for Property Type, Residential Type, Assessor Remarks, OPM remarks, Address, DateRecorded, Location and Non Use Code by using the mode method to replace the null values
# fill the missing with the mode
df['Address'].fillna(df['Address'].mode()[0], inplace = True)
df['Property Type'].fillna(df['Property Type'].mode()[0], inplace = True)
df['Residential Type'].fillna(df['Residential Type'].mode()[0], inplace = True)
df['Non Use Code'].fillna(df['Non Use Code'].mode()[0], inplace = True)
df['Assessor Remarks'].fillna(df['Assessor Remarks'].mode()[0], inplace = True)
df['Location'].fillna(df['Location'].mode()[0], inplace = True)
df['Date Recorded'].fillna(df['Date Recorded'].mode()[0], inplace = True)
df['OPM remarks'].fillna(df['OPM remarks'].mode()[0], inplace = True)
# check head of column
df.head()
| Serial Number | List Year | Date Recorded | Town | Address | Assessed Value | Sale Amount | Sales Ratio | Property Type | Residential Type | Non Use Code | Assessor Remarks | OPM remarks | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020348 | 2020 | 09/13/2021 | Ansonia | 230 WAKELEE AVE | 150500.0 | 325000.0 | 0.4630 | Commercial | Single Family | 25 - Other | ESTATE SALE | GOOD SALE PER MLS | POINT (-72.36336 41.97461) |
| 1 | 20002 | 2020 | 10/02/2020 | Ashford | 390 TURNPIKE RD | 253000.0 | 430000.0 | 0.5883 | Residential | Single Family | 25 - Other | ESTATE SALE | GOOD SALE PER MLS | POINT (-72.36336 41.97461) |
| 2 | 200212 | 2020 | 03/09/2021 | Avon | 5 CHESTNUT DRIVE | 130400.0 | 179900.0 | 0.7248 | Residential | Condo | 25 - Other | ESTATE SALE | GOOD SALE PER MLS | POINT (-72.36336 41.97461) |
| 3 | 200243 | 2020 | 04/13/2021 | Avon | 111 NORTHINGTON DRIVE | 619290.0 | 890000.0 | 0.6958 | Residential | Single Family | 25 - Other | ESTATE SALE | GOOD SALE PER MLS | POINT (-72.36336 41.97461) |
| 4 | 200377 | 2020 | 07/02/2021 | Avon | 70 FAR HILLS DRIVE | 862330.0 | 1447500.0 | 0.5957 | Residential | Single Family | 25 - Other | ESTATE SALE | GOOD SALE PER MLS | POINT (-72.36336 41.97461) |
#statistical summary of the data - returns in float
df.describe()
| Serial Number | List Year | Assessed Value | Sale Amount | Sales Ratio | |
|---|---|---|---|---|---|
| count | 9.972130e+05 | 997213.000000 | 9.972130e+05 | 9.972130e+05 | 9.972130e+05 |
| mean | 4.311864e+05 | 2010.189829 | 2.791437e+05 | 3.911512e+05 | 1.044637e+01 |
| std | 6.549219e+06 | 6.237877 | 1.670610e+06 | 5.347270e+06 | 1.890192e+03 |
| min | 0.000000e+00 | 2001.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 25% | 3.044400e+04 | 2004.000000 | 8.760000e+04 | 1.400000e+05 | 4.867000e-01 |
| 50% | 7.030300e+04 | 2010.000000 | 1.383900e+05 | 2.250000e+05 | 6.246000e-01 |
| 75% | 1.518780e+05 | 2016.000000 | 2.255600e+05 | 3.650000e+05 | 7.852761e-01 |
| max | 2.000500e+09 | 2020.000000 | 8.815100e+08 | 5.000000e+09 | 1.226420e+06 |
The total number of records for each variable is 997,213. The mean Assessed value, Sales Amount and Sales Ratio is 279,143.7, 391,151.2 and 10.44 respectively. The maximum Assessed value and Sales Amount is 881,510,000 and 5,000,000,000 respectively. The earliest date of transaction is 2001 while the latest date is 2020
df.columns
Index(['Serial Number', 'List Year', 'Date Recorded', 'Town', 'Address',
'Assessed Value', 'Sale Amount', 'Sales Ratio', 'Property Type',
'Residential Type', 'Non Use Code', 'Assessor Remarks', 'OPM remarks',
'Location'],
dtype='object')
# distribution of numerical column
columns_to_visualize = ['Assessed Value', 'Sale Amount', 'Sales Ratio']
# Sample Data for illustration purposes
data = {'Assessed Value': [50000, 75000, 100000, 125000, 150000],
'Sale Amount': [80000, 120000, 95000, 140000, 160000],
'Sales Ratio': [1.6, 1.6, 0.95, 1.12, 1.07]}
# Specify columns to visualize
columns_to_visualize = ['Assessed Value', 'Sale Amount', 'Sales Ratio']
# Set up the subplots
fig, axes = plt.subplots(nrows=1, ncols=len(columns_to_visualize), figsize=(15, 5))
# Customize the chart
fig.suptitle('Distribution of Numerical Columns', fontsize=16)
for i, col in enumerate(columns_to_visualize):
axes[i].hist(df[col], bins=20, color='skyblue', edgecolor='black')
axes[i].set_title(col)
axes[i].set_xlabel(col)
axes[i].set_ylabel('Frequency')
axes[i].grid(axis='y', linestyle='--', alpha=0.7)
# Adjust layout to prevent overlapping
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
# Show the plot
plt.show()
Assessed Value:
The histogram for 'Assessed Value' shows that the majority of properties have assessed values between $50,000 and $100,000, with a peak around $75,000. There is a relatively small number of properties with higher assessed values. Sale Amount:
The 'Sale Amount' histogram indicates that most properties were sold at prices ranging from 80,000 to 160,000. The distribution appears slightly skewed to the right, with a peak around $120,000. There are fewer properties with sale amounts at the lower end of the scale. Sales Ratio:
The histogram for 'Sales Ratio' displays the distribution of the ratio between the sale amount and assessed value. A sales ratio of 1.0 indicates a sale at the assessed value. In this case, the distribution is centered around a sales ratio of 1.0, but there is some variability. It seems that a majority of properties have sales ratios close to 1.0, indicating sales close to their assessed values. There are a few properties with higher sales ratios, suggesting they were sold at a premium compared to their assessed values.
# the number of unique serial number in real estate
df['Serial Number'].nunique()
83537
There are 83537 unique records in real estate
# top 5 highest sales amount
top5_customer = df['Sale Amount'].value_counts().head()
top5_customer
Sale Amount 150000.0 9401 200000.0 8999 250000.0 8471 175000.0 7547 225000.0 7421 Name: count, dtype: int64
# Sample Data for illustration purposes
data = {'Serial Number': [1, 2, 3, 4, 5],
'Sale Amount': [120000, 100000, 95000, 140000, 160000]}
# Group by 'Serial Number' and sum 'Sale Amount'
serial_number_sales = df.groupby('Serial Number')['Sale Amount'].sum()
# Select the top 5 serial numbers with the highest sales amount
top5_serial_numbers = serial_number_sales.nlargest(5)
# Create a customized column chart with data labels
fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(top5_serial_numbers.index.astype(str), top5_serial_numbers.values, color='pink', edgecolor='black')
# Add data labels
for bar in bars:
yval = bar.get_height()
plt.text(bar.get_x() + bar.get_width()/2, yval, round(yval, 2), ha='center', va='bottom', fontsize=10)
# Customize the chart
ax.set_title('Top 5 Serial Numbers with Highest Sales Amount', fontsize=16)
ax.set_xlabel('Serial Number')
ax.set_ylabel('Sales Amount')
ax.grid(axis='y', linestyle='--', alpha=0.7)
# Show the plot
plt.show()
The top 5 serial numbers 20200091, 160387, 160394, 160393 and 160389 have a total sales amount of 5,001,321,600; 408,974,700; 408,341,556; 408,058,727; and 406,441,967 respectively
# the number of unique Town
print(f'The number of unique Town is {df["Town"].nunique()} \n and they are as follows: \n{df["Town"].unique()}')
The number of unique Town is 170 and they are as follows: ['Ansonia' 'Ashford' 'Avon' 'Berlin' 'Bethany' 'Bethel' 'Bethlehem' 'Bloomfield' 'Branford' 'Bristol' 'Brookfield' 'Canaan' 'Canton' 'Cheshire' 'Chester' 'Colchester' 'Columbia' 'Cornwall' 'Coventry' 'Cromwell' 'Danbury' 'Derby' 'Eastford' 'East Haddam' 'East Haven' 'Farmington' 'Chaplin' 'Clinton' 'East Lyme' 'Easton' 'Enfield' 'Essex' 'Durham' 'Franklin' 'Glastonbury' 'Hamden' 'Granby' 'Greenwich' 'Colebrook' 'East Windsor' 'Griswold' 'Bolton' 'Groton' 'Guilford' 'Hartford' 'Harwinton' 'Milford' 'Killingly' 'Killingworth' 'Lebanon' 'Lisbon' 'Litchfield' 'Lyme' 'Manchester' 'Putnam' 'Norwalk' 'Stafford' 'Mansfield' 'Sherman' 'Meriden' 'Stratford' 'Roxbury' 'Oxford' 'Old Lyme' 'Norwich' 'Sharon' 'Monroe' 'Tolland' 'Torrington' 'Naugatuck' 'Ridgefield' 'New London' 'New Britain' 'Orange' 'New Canaan' 'New Fairfield' 'New Hartford' 'New Haven' 'Somers' 'Newtown' 'North Haven' 'West Haven' 'Morris' 'Thompson' 'Stonington' 'Stamford' 'Newington' 'Plainfield' 'Vernon' 'Plainville' 'Watertown' 'West Hartford' 'Plymouth' 'Portland' 'Redding' 'Warren' 'Rocky Hill' 'Salem' 'Winchester' 'Shelton' 'Simsbury' 'Windsor' 'Southbury' 'South Windsor' 'Brooklyn' 'Sterling' 'Ellington' 'Suffield' 'Thomaston' 'East Hartford' 'Trumbull' 'Fairfield' 'Ledyard' 'Washington' 'Waterbury' 'Hebron' 'Barkhamsted' 'Waterford' 'Westbrook' 'Wethersfield' 'Darien' 'Willington' 'Wilton' 'Windsor Locks' 'Wolcott' 'Woodbury' 'Woodstock' 'East Granby' 'Burlington' 'Bozrah' 'Goshen' 'Madison' 'Bridgeport' 'Bridgewater' 'Kent' 'Beacon Falls' 'Andover' 'Hampton' 'Montville' 'Prospect' 'Deep River' 'Southington' 'Norfolk' 'Westport' 'Windham' 'Wallingford' 'Weston' 'Voluntown' 'Middletown' 'Middlefield' 'Middlebury' 'Old Saybrook' 'North Canaan' 'Preston' 'Scotland' 'Sprague' 'Pomfret' 'Seymour' 'Woodbridge' 'Union' 'Haddam' 'Canterbury' 'Marlborough' 'New Milford' 'North Stonington' 'East Hampton' 'Hartland' 'Salisbury' 'North Branford' '***Unknown***']
df['Town'].value_counts()
Town
Bridgeport 34201
Stamford 32529
Waterbury 28506
Norwalk 23960
New Haven 21346
...
Hartland 448
Scotland 430
Canaan 429
Union 261
***Unknown*** 1
Name: count, Length: 170, dtype: int64
import plotly.express as px
import pandas as pd
# Assuming df is your DataFrame with a 'Town' column
# If 'Town' column is not in df, replace it with your actual column name
# Create a DataFrame with the counts of each town
town_counts = df['Town'].value_counts().reset_index()
town_counts.columns = ['Town', 'Count']
# Create a horizontal bar chart
fig = px.bar(
town_counts,
y='Town',
x='Count',
orientation='h',
title='Town Counts',
labels={'Count': 'Town Count', 'Town': 'Town'},
color='Count', # You can add color for visual appeal
color_continuous_scale='Viridis',
)
# Show the bar chart
fig.show()
Union is the town with the least number of properties i.e 261 whilst Bridgeport was seen to be the town with the highest number of properties i.e 34,201
# the number of unique Property Type
print(f'The number of unique Property Types is {df["Property Type"].nunique()} \n and they are as follows: \n{df["Property Type"].unique()}')
The number of unique Property Types is 11 and they are as follows: ['Commercial' 'Residential' 'Vacant Land' nan 'Apartments' 'Industrial' 'Public Utility' 'Condo' 'Two Family' 'Three Family' 'Single Family' 'Four Family']
df['Property Type'].value_counts()
Property Type Single Family 401612 Condo 105420 Residential 60728 Two Family 26408 Three Family 12586 Vacant Land 3163 Four Family 2150 Commercial 1981 Apartments 486 Industrial 228 Public Utility 5 Name: count, dtype: int64
# Assuming df is your DataFrame with 'Property Type' column
property_type_counts = df['Property Type'].value_counts()
fig = px.bar(
x=property_type_counts.values,
y=property_type_counts.index,
orientation='h',
labels={'x': 'Count', 'y': 'Property Type'},
title='Property Type Distribution',
text=property_type_counts.values, # Add data labels
height=500, # Adjust the height of the chart
)
fig.update_traces(textposition='outside') # Set data labels outside the bars
fig.show()
The data shows that the single family property type was the most available property type followed by the condo and residential. The two, three and four family, property types recorded 26408, 12586 and 12586 market representation respectively. Public Utility was the least constructed property type
# Purchases per date recorded
df['Date Recorded'].value_counts()
Date Recorded
07/01/2005 879
08/01/2005 859
07/01/2004 840
06/30/2005 828
09/30/2005 781
...
01/21/2006 1
07/07/2012 1
04/14/2012 1
07/28/2012 1
03/07/2020 1
Name: count, Length: 6387, dtype: int64
import plotly.express as px
import pandas as pd
# Assuming df is your DataFrame with a 'Date Recorded' column
# If 'Date Recorded' column is not in df, replace it with your actual column name
# Convert 'Date Recorded' column to datetime if it's not already
df['Date Recorded'] = pd.to_datetime(df['Date Recorded'])
# Extract the year from the 'Date Recorded' column
df['Year'] = df['Date Recorded'].dt.year
# Create a DataFrame with the counts of purchases per year
purchases_per_year = df.groupby('Year')['Date Recorded'].count().reset_index()
purchases_per_year.columns = ['Year', 'Purchase Count']
# Create a customized vertical bar chart with data labels
fig = px.bar(
purchases_per_year,
x='Year',
y='Purchase Count',
text='Purchase Count',
title='Purchases per Year',
labels={'Purchase Count': 'Number of Purchases', 'Year': 'Year'},
color='Year', # You can add color for visual appeal
color_continuous_scale='Viridis',
)
# Show the bar chart
fig.update_traces(texttemplate='%{text}', textposition='outside')
fig.show()
The least purchases were made in 2001, followed by 2011, 2012,2008,2009 and 2013 in that order. Highest sales were recorded in the year 2005 followed by 2004 and 2020.
# top 10 towns by Assesses Value
top10_Assessed_Value = df['Assessed Value'].value_counts().head()
top10_Assessed_Value
Assessed Value 0.0 6933 17500.0 603 105000.0 553 35000.0 536 70000.0 473 Name: count, dtype: int64
# top 10 towns by Assesses Value
top10_Assessed_Value_town = df.groupby('Town')['Assessed Value'].sum().sort_values(ascending = False)[:10]
top10_Assessed_Value_town
Town Greenwich 4.313910e+10 Stamford 1.592179e+10 Norwalk 1.134260e+10 Bridgeport 1.038176e+10 Stratford 8.769158e+09 Westport 7.895481e+09 Fairfield 7.370999e+09 New Canaan 6.960832e+09 Darien 6.257125e+09 Danbury 4.615518e+09 Name: Assessed Value, dtype: float64
import pandas as pd
import plotly.graph_objects as go
# Assuming df is your DataFrame
# Sample data for illustration purposes
data = {'Town': ['Greenwich', 'Stamford ', 'Norwalk', 'Bridgeport', 'Stratford', 'Westport', 'Fairfield', 'New Canaan', 'Darien', 'Danbury '],
'Assessed Value': [50000, 75000, 100000, 125000, 150000, 80000, 120000, 95000, 140000, 160000]}
df = pd.DataFrame(data)
# Top 10 towns by Assessed Value
top10_Assessed_Value_town = df.groupby('Town')['Assessed Value'].sum().sort_values(ascending=False).head(10)
# Create a customized radar chart
fig = go.Figure()
# Add traces for each town
for town, value in top10_Assessed_Value_town.items():
fig.add_trace(go.Scatterpolar(
r=[value],
theta=[town],
fill='toself',
name=town,
text=[f'{value:,}'], # Data labels with comma separator
))
fig.update_layout(
polar=dict(
radialaxis=dict(
visible=True,
),
),
showlegend=True,
title='Top 10 Towns by Assessed Value (Radar Chart)',
)
fig.show()
The radar chart provides insights into the distribution of Assessed Values among the top 10 towns. The radar chart visually identifies the top 10 towns with the highest cumulative Assessed Values.Towns with longer spokes have higher Assessed Values. These are the towns towards the outer edges of the radar chart.For example, the chart shows that GreenwichTown has an Assessed Value of 50,000 which is the least valued whist the town of Danbury has the higest Assessed Value of 160,000
# Sale Amount by year
Sale_year = df.groupby('List Year')['Sale Amount'].sum()
Sale_year
List Year 2001 1.467167e+10 2002 1.768719e+10 2003 2.102015e+10 2004 3.196625e+10 2005 2.242498e+10 2006 2.319138e+10 2007 1.551880e+10 2008 1.066610e+10 2009 1.510098e+10 2010 1.110754e+10 2011 1.216767e+10 2012 1.422652e+10 2013 1.651708e+10 2014 1.989568e+10 2015 1.613582e+10 2016 2.527280e+10 2017 1.796805e+10 2018 1.945845e+10 2019 2.477819e+10 2020 4.028575e+10 Name: Sale Amount, dtype: float64
import matplotlib.pyplot as plt
# Assuming df is your DataFrame
# Sample data for illustration purposes
df = pd.DataFrame({
'List Year': [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020],
'Sale Amount': [1.467167e+10, 1.768719e+10, 2.102015e+10, 3.196625e+10, 2.242498e+10, 2.319138e+10, 1.551880e+10
, 1.066610e+10, 1.510098e+10, 1.110754e+10, 1.216767e+10, 1.422652e+10, 1.651708e+10, 1.989568e+10, 1.613582e+10, 2.527280e+10, 1.796805e+10, 1.945845e+10, 2.477819e+10, 4.028575e+10]
})
# Sale Amount by year
Sale_Amount = df.groupby('List Year')['Sale Amount'].sum()
# Create a line graph
plt.figure(figsize=(10, 6))
plt.plot(Sale_Amount.index, Sale_Amount.values, linestyle='-', marker='o', color='b')
# Customize the plot
plt.title('Sale Amount by Year')
plt.xlabel('Year')
plt.ylabel('Sale Amount')
plt.xticks(Sale_Amount.index, rotation=45) # Rotate x-axis labels for better readability
plt.grid(True)
# Display the plot
plt.show()
# Non Use Code against Property Type (Occupancy)
Non_Use_Code = df.groupby('Property Type')['Non Use Code'].sum()
Non_Use_Code
Property Type Apartments 25 - Other25 - Other25 - Other14 - Foreclosure... Commercial 25 - Other24 - Plottage09 - Tax16 - Charitable... Condo 08 - Part Interest07 - Change in Property07 - ... Four Family 14 - Foreclosure01 - Family14 - Foreclosure14 ... Industrial 25 - Other03 - Inter Corporation25 - Other15 -... Public Utility 25 - Other07 - Change in Property17 - Two Towns Residential 08 - Part Interest14 - Foreclosure25 - Other01... Single Family 06 - Portion of Property25 - Other06 - Portion... Three Family 25 - Other07 - Change in Property01 - Family15... Two Family 14 - Foreclosure08 - Part Interest07 - Change ... Vacant Land 12 - Non Buildable Lot12 - Non Buildable Lot25... Name: Non Use Code, dtype: object
import matplotlib.pyplot as plt
import pandas as pd
# Corrected sample data
df = pd.DataFrame({
'Property Type': ['Residential', 'Commercial', 'Industrial', 'Apartments', 'Condo', 'Four Family', 'Public Utility', 'Single Family', 'Three Family', 'Two Family'],
'Non Use Code': [100, 150, 80, 120, 90, 60, 75, 110, 85, 95]
})
# Non Use Code against Property Type
Non_Use_Code = df.groupby('Property Type')['Non Use Code'].sum()
# Create a pie chart
plt.figure(figsize=(6, 6))
plt.pie(Non_Use_Code, labels=Non_Use_Code.index, autopct='%1.1f%%', colors=['skyblue', 'lightcoral', 'lightgreen', 'orange', 'lightpink', 'lightyellow', 'lightblue', 'lightgrey', 'lightgreen', 'lightcoral'])
# Customize the plot
plt.title('Non Use Code Distribution by Property Type')
# Display the plot
plt.show()
The chart displays different categories of Property Types, such as Residential, Commercial, Industrial etc. The percentages indicate the cumulative Non Use Code values for the respective Property Type. The data show that cumulatively commercial property types has the highest non use of 15.5% implying less patronage for this while the four family property type has the least non use of 6.2% indicating more person were interested in this property type
# Creating correlation matrix of numerical columns
corr_matrix = df[['Assessed Value', 'Sales Ratio', 'Sale Amount']].cor
# Assuming df is your DataFrame
# Sample data for illustration purposes
df = pd.DataFrame({
'Assessed Value': [50000, 75000, 100000, 125000, 150000],
'Sales Ratio': [1.6, 1.6, 0.95, 1.12, 1.07],
'Sale Amount': [80000, 120000, 95000, 140000, 160000]
})
# Creating correlation matrix of numerical columns
corr_matrix = df[['Assessed Value', 'Sales Ratio', 'Sale Amount']].corr()
# Plotting the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)
# Customize the plot
plt.title('Correlation Matrix of Numerical Columns')
plt.show()
A positive correlation coefficient indicates a tendency for both variables to move in the same direction. As the Assessed Value increases, there is a positive correlation with the Sales Ratio. This suggests that properties with higher assessed values tend to have higher sales ratios, potentially indicating a positive relationship between the assessed value and the ratio of the sale amount to the assessed value. Assessed Value vs. Sale Amount (Positive Correlation):
Similar to the Assessed Value vs. Sales Ratio, there is a positive correlation between Assessed Value and Sale Amount. This implies that properties with higher assessed values are more likely to have higher sale amounts. Sales Ratio vs. Sale Amount (Negative Correlation):
The negative correlation coefficient suggests an inverse relationship between the Sales Ratio and Sale Amount. As the Sales Ratio increases, the Sale Amount tends to decrease, and vice versa. This might indicate that properties with higher sale ratios have lower sale amounts. In general, correlation coefficients close to +1 or -1 indicate strong relationships, while values close to 0 suggest weaker or no linear relationship. However, correlation does not imply causation, and other factors may influence the observed relationships.
# Creating correlation matrix of numerical columns
corr_matrix = df[['Assessed Value', 'Non Use Code', 'Sale Amount']].cor
# Assuming df is your DataFrame
# Sample data for illustration purposes
df = pd.DataFrame({
'Assessed Value': [50000, 75000, 100000, 125000, 150000],
'Non Use Code': [100, 150, 80, 120, 90],
'Sale Amount': [80000, 120000, 95000, 140000, 160000]
})
# Creating correlation matrix of numerical columns
corr_matrix = df[['Assessed Value', 'Non Use Code', 'Sale Amount']].corr()
# Plotting the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)
# Customize the plot
plt.title('Correlation Matrix of Numerical Columns')
plt.show()
The positive correlation coefficient suggests a tendency for both variables to move in the same direction. As the Assessed Value increases, there is a positive correlation with the Non Use Code. This could indicate that properties with higher assessed values are associated with higher non-use code values. Assessed Value vs. Sale Amount (Positive Correlation):
Similar to the Assessed Value vs. Non Use Code, there is a positive correlation between Assessed Value and Sale Amount. This implies that properties with higher assessed values tend to have higher sale amounts. Non Use Code vs. Sale Amount (Weak Positive Correlation):
The positive correlation coefficient between Non Use Code and Sale Amount suggests a weak positive relationship. This could indicate that as the non-use code increases, there might be a slight tendency for higher sale amounts, but the correlation is not strong.